Restructure Cosmos benchmark agent to 3 deterministic skills#48165
Draft
xinlian12 wants to merge 23 commits intoAzure:mainfrom
Draft
Restructure Cosmos benchmark agent to 3 deterministic skills#48165xinlian12 wants to merge 23 commits intoAzure:mainfrom
xinlian12 wants to merge 23 commits intoAzure:mainfrom
Conversation
Add benchmark shell scripts for VM provisioning, setup, execution, monitoring, diagnostics capture, and dashboard generation. Update BenchmarkConfig, BenchmarkOrchestrator, and TenantWorkloadConfig to support multi-tenant benchmark orchestration with per-tenant configuration overrides. Add .gitignore entries for benchmark artifacts and Copilot skills. Add test-setup and test-results directory scaffolding with READMEs and a sample tenants.json template (no real credentials). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Agent routing file dispatches to 5 skills covering the full benchmark/DR drill lifecycle: - provision: Cosmos DB accounts, App Insights, Azure VMs - setup: JDK/Maven install, repo clone, config generation, build - run: CHURN preset execution, multi-VM parallel, App Insights config - analyze: CSV metrics, run comparison, heap/thread dumps, Kusto export - status: resource health, run overview, App Insights verification Also includes skill-creator utility for authoring new skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ate runtime config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bf46a2f to
c43f5b6
Compare
Consolidate the benchmark agent from 5 skills down to 3, with deterministic script-driven flows replacing inline commands. Skills: - setup-resources: provision Azure infra (Cosmos DB, App Insights, VM) with parallel creation, capacity validation, region fallback, and verification gate - run: clone/build/verify/execute benchmarks via single SSH session per ref, supports multiple refs for comparison, SIMPLE/EXPAND/CHURN presets - analyze: download results to config-dir/results, generate markdown report with time-series SVG charts and multi-run comparison tables Key changes: - Rename provision -> setup-resources, merge setup into run, remove status - .github/skills and .github/agents use symlinks to copilot/ (single source) - Default region westus2, resource group rg-cosmos-benchmark-YYYYMMDD - Config directory prompted with credential-in-repo warning - provision-all.sh orchestrates parallel resource creation + verification - vm-prepare-and-run.sh consolidates checkout/build/verify/run in 1 SSH session - run-all-refs.sh loops over user-provided refs with per-ref result directories - generate-report.py reads monitor.csv + metrics/*.csv, outputs report.md - Remove parse_hprof.py, kusto-schema.md, generate-dashboard.py (deferred) - Remove trigger-benchmark.sh (superseded by vm-prepare-and-run.sh) - Merge setup-benchmark-vm.sh into provision-benchmark-vm.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add timestamped progress logging to validate-capacity.sh - Fix restriction detection to handle all types (Zone, NotAvailableForSubscription) - Replace slow per-SKU API calls with single-call alternative SKU search - Add --find-alternatives flag to control similar SKU search - Add restriction_reason field to JSON output - Derive quota family dynamically from effective SKU - Add --fallback-regions flag to find-region.sh for user-specified regions - Implement 4-phase search: preferred exact → preferred similar → fallback exact → fallback similar - Add [N/M] progress updates printed as each region completes - Add --stop-on-first flag (default: true) - Fix integration bugs: JSON path, exit code logic, stdin-based parsing - Update SKILL.md to document new flags and search strategy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add capacity validation step before resource creation that blocks unless all checks pass (VM SKU, quota, Cosmos DB, App Insights) - Add --skip-capacity-check flag to override the gate - Add timestamped log() function for all progress messages - Add elapsed time tracking per resource and total provisioning time - Fix JSON parsing to match validate-capacity.sh output format - Update SKILL.md to document new behavior and flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap benchmark execution in tmux session ('bench') on VM so the
process survives SSH disconnections
- Add async execution guidance to SKILL.md so the agent runs the
orchestrator in background mode, keeping the user's context free
- Use scenario-based poll intervals (2min for SIMPLE, 5min for
EXPAND/CHURN) instead of 10s fixed polling
- Expand monitoring section with local and VM-side status checks
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Detect refs like 'xinlian12/branchName' by checking if the part before the first slash matches an existing git remote - If remote exists, fetch from that remote; otherwise treat the slash as part of the branch name on origin - Document fork branch format in SKILL.md ref examples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Instruct agent to proactively verify the run is progressing after async launch — if the shell exits too quickly, investigate - Add diagnosis steps: check results dirs, git state, JAR, tmux - Document common failures table (checkout, build, startup, SSH) - Require confirming with user before relaunching after a failure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- New script checks tmux session, results directories (with per-run status), git state, build status, and optionally system resources - Supports --run-name for run-specific details (monitor samples, metrics, disk usage) and --verbose for system resource info - Updated SKILL.md to reference check-status.sh in monitoring and troubleshooting sections - Fix SSH stdin consumption in while-read loop with -n flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SCP vm-prepare-and-run.sh, run-benchmark.sh, monitor.sh, and capture-diagnostics.sh to ~/benchmark-scripts/ on the VM - Execute remotely via 'bash ~/benchmark-scripts/vm-prepare-and-run.sh' instead of 'bash -s' stdin piping which broke heredocs - Update vm-prepare-and-run.sh to reference co-located scripts from ~/benchmark-scripts/ in the tmux run script Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh now only SCPs vm-prepare-and-run.sh (the bootstrapper) instead of all 4 scripts - After checkout, vm-prepare-and-run.sh resolves scripts from the cloned repo (copilot/skills/.../scripts/) so they match the ref being benchmarked - Falls back to ~/benchmark-scripts/ if the repo doesn't include the scripts yet (e.g., older branches) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh: --force-copy-scripts copies ALL scripts to VM (not just the bootstrapper) and passes --force-scripts to the bootstrapper - vm-prepare-and-run.sh: --force-scripts overrides repo-first resolution, using ~/benchmark-scripts/ (the SCP'd copies) instead - Default behavior unchanged: repo scripts used after checkout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- run-all-refs.sh now starts vm-prepare-and-run.sh inside a tmux session, so checkout, build, verify AND run all survive SSH disconnection - vm-prepare-and-run.sh Step 4 simplified: runs run-benchmark.sh directly (no nested tmux, no .run.sh heredoc generation) - Polling and exit code logic moved to run-all-refs.sh orchestrator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Write a small /tmp/bench-launch.sh on the VM that wraps vm-prepare-and-run.sh and writes the exit code - Avoids nested quoting issues (SSH -> tmux -> bash -> args) - Fix stale EXIT_CODE_FILE variable reference Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use $HOME instead of ~ in double-quoted string to ensure correct path expansion when interpolated into SSH commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When run-benchmark.sh is executed from ~/benchmark-scripts/ (SCP'd copy), SCRIPT_DIR/../ doesn't point to the benchmark module. Fall back to PWD if the script's parent doesn't contain a target/ dir. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…gDirectory - --tenantsFile -> -tenantsFile (JCommander uses single dash) - Remove --scenario and --outputDir (not valid Configuration params) - Add -reportingDirectory for CSV metrics output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace fire-and-forget async launch with a two-step workflow: Step A: Launch orchestrator with sync mode (initial_wait: 60) Step B: Mandatory verify via check-status.sh within 90s Prevents the agent from telling the user 'it's running' without actually confirming tmux is alive and results directory exists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously, createBenchmarks() initialized Cosmos clients sequentially in a for loop. With 50 tenants, each taking ~10-15s (connect + create DB/container + populate docs), initialization alone took ~8-10 minutes. Now submits all tenant initializations to the existing ExecutorService in parallel, collecting results via Future.get(). With 50 tenants on a 50-thread pool, initialization completes in ~15-20s instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 3639600.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cosmos Benchmark Agent
Overview
Adds a Copilot-powered Cosmos DB benchmark agent that automates the full benchmark lifecycle: provisioning infrastructure, running benchmarks, and analyzing results. The agent is organized into 3 deterministic, script-driven skills with a clear sequential workflow:
How to Use (Copilot CLI)
1. Select the agent
From the Copilot CLI, use
@to select the cosmos-benchmark agent:Or start a session and the agent will be auto-selected based on context when working under
sdk/cosmos/azure-cosmos-benchmark/.2. Example workflows
Full benchmark — provision, run, analyze:
Quick validation of a PR:
Check on a running benchmark:
Reuse existing infrastructure:
3. Key commands
setup resourcesrun benchmark on <refs>peek/check statusanalyze resultscapture diagnosticsAgent Structure
Skills
1.
setup-resources— Provision Azure InfrastructureCreates Cosmos DB accounts, Application Insights, and Azure VMs. Exports credentials to a config directory consumed by downstream skills.
Script flow:
Config directory outputs (consumed by
runskill):2.
run— Build & Execute BenchmarksClones repo at specified branch/PR/commit, builds the benchmark JAR, and executes scenarios on the VM inside a tmux session for resilience against SSH disconnections.
Script flow:
Supports: multiple refs for comparison (
mainvs feature branch), scenario presets (SIMPLE ~30 min, EXPAND ~90 min, CHURN for leak detection),--force-copy-scriptsto test local script changes.3.
analyze— Download & Report ResultsDownloads results from the VM and generates comparison reports with pass/fail thresholds.
Script flow:
Benchmark Modes
The framework supports two modes — purely a configuration choice:
-tenantsFile tenants.jsonwith multiple account configurationsBoth use the same JAR, orchestrator, and monitoring infrastructure.
Key Design Decisions
--force-copy-scriptsoverrides repo scripts with local versions for testing changes before they are mergedcheck-status.shto verify the benchmark is actually running before reporting success to the userAdditional Changes
skill-creatorskill: a meta-skill for authoring new skills with proper structure and conventions